Distributed Data Placement via Graph Partitioning

نویسندگان

  • Lukasz Golab
  • Marios Hadjieleftheriou
  • Howard J. Karloff
  • Barna Saha
چکیده

With the widespread use of shared-nothing clusters of servers, there has been a proliferation of distributed object stores that offer high availability, reliability and enhanced performance for MapReducestyle workloads. However, relational workloads cannot always be evaluated efficiently using MapReduce without extensive data migrations, which cause network congestion and reduced query throughput. We study the problem of computing data placement strategies that minimize the data communication costs incurred by typical relational query workloads in a distributed setting. Our main contribution is a reduction of the data placement problem to the well-studied problem of GRAPH PARTITIONING, which is NP-Hard but for which efficient approximation algorithms exist. The novelty and significance of this result lie in representing the communication cost exactly and using standard graphs instead of hypergraphs, which were used in prior work on data placement that optimized for different objectives (not communication cost). We study several practical extensions of the problem: with load balancing, with replication, with materialized views, and with complex query plans consisting of sequences of intermediate operations that may be computed on different servers. We provide integer linear programs (IPs) that may be used with any IP solver to find an optimal data placement. For the no-replication case, we use publicly available graph partitioning libraries (e.g., METIS) to efficiently compute nearly-optimal solutions. For the versions with replication, we introduce two heuristics that utilize the GRAPH PARTITIONING solution of the no-replication case. Using the TPCDS workload, it may take an IP solver weeks to compute an optimal data placement, whereas our reduction produces nearly-optimal solutions in seconds.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Graph Partitioning via Parallel Submodular Approximation to Accelerate Distributed Machine Learning

Distributed computing excels at processing large scale data, but the communication cost for synchronizing the shared parameters may slow down the overall performance. Fortunately, the interactions between parameter and data in many problems are sparse, which admits efficient partition in order to reduce the communication overhead. In this paper, we formulate data placement as a graph partitioni...

متن کامل

A Scalable Distributed Graph Partitioner

We present Scalable Host-tree Embeddings for Efficient Partitioning (Sheep), a distributed graph partitioning algorithm capable of handling graphs that far exceed main memory. Sheep produces high quality edge partitions an order of magnitude faster than both state of the art offline (e.g., METIS) and streaming partitioners (e.g., Fennel). Sheep’s partitions are independent of the input graph di...

متن کامل

Distributed Graph-Partitioning based Coalition Formation for Collaborative Multi-Agent Systems Some Lessons Learned and Challenges Ahead

We study algorithms for distributed collaborative multi-agent coalition formation. The focus of our recent and ongoing research has been on coalition formation via scalable distributed graph partitioning of the underlying agents’ communication network topology. In that endeavor, we have been analyzing, simulating and optimizing our original graph partitioning algorithm called Maximal Clique bas...

متن کامل

Scalable Linked Data Stream Processing via Network-Aware Workload Scheduling

In order to cope with the ever-increasing data volume, distributed stream processing systems have been proposed. To ensure scalability most distributed systems partition the data and distribute the workload among multiple machines. This approach does, however, raise the question how the data and the workload should be partitioned and distributed. A uniform scheduling strategy—a uniform distribu...

متن کامل

Optimal Placement and Sizing of Distributed Generation Via an Improved Nondominated Sorting Genetic Algorithm II

The use of distributed generation units in distribution networks has attracted the attention of network managers due to its great benefits. In this research, the location and determination of the capacity of distributed generation (DG) units for different purposes has been studied simultaneously. The multi-objective functions in the optimization model are reducing system line losses; reducing v...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1312.0285  شماره 

صفحات  -

تاریخ انتشار 2013